Generalizable disease detection using model ensemble on chest X-ray images

In the realm of healthcare, the demand for swift and precise diagnostic tools has been steadily increasing. This study delves into a comprehensive performance analysis of three pre-trained convolutional neural network (CNN) architectures: ResNet50, DenseNet121, and Inception-ResNet-v2. To ensure the broad applicability of our approach, we curated a large-scale dataset comprising a diverse collection of chest X-ray images, that included both positive and negative cases of COVID-19. The models’ performance was evaluated using separate datasets for internal validation (from the same source as the training images) and external validation (from different sources). Our examination uncovered a significant drop in network efficacy, registering a 10.66% reduction for ResNet50, a 36.33% decline for DenseNet121, and a 19.55% decrease for Inception-ResNet-v2 in terms of accuracy. Best results were obtained with DenseNet121 achieving the highest accuracy at 96.71% in internal validation and Inception-ResNet-v2 attaining 76.70% accuracy in external validation. Furthermore, we introduced a model ensemble approach aimed at improving network performance when making inferences on images from diverse sources beyond their training data. The proposed method uses uncertainty-based weighting by calculating the entropy in order to assign appropriate weights to the outputs of each network. Our results showcase the effectiveness of the ensemble method in enhancing accuracy up to 97.38% for internal validation and 81.18% for external validation, while maintaining a balanced ability to detect both positive and negative cases.

• Creating a robust COVID-19 detection model through transfer learning on pre-trained CNNs from Ima- geNet.• Assessing the model's generalization on diverse internal and external validation sets, validating its ability to generalize across different datasets.• Introducing a novel entropy technique to weigh model outputs, striving for a more accurate overall result when combining the models.
We work under the assumption that training with a comprehensive dataset covering all possible medical images worldwide is impractical.Instead, we acknowledge that models available for use have been trained on datasets that differ from those specific to individual hospitals.The core idea is that combining various models can offer an enhanced solution, addressing the variability in image datasets encountered across different healthcare facilities.This research not only serves as a proof of concept for streamlining the medical image classification process but also contributes to the advancement and fortification of these methodologies within the healthcare sector.

Materials and methods
The following section outlines the datasets and methods used in this research.
No metadata is associated with the images in this database.• The COVIDGR dataset 23 is a curated collection of chest X-ray images annotated with findings related to COVID-19, and contains 426 positive cases and 426 negative cases.Positive cases have accompanying metadata indicating the severity of the illness on a scale ranging from severe to moderate, mild, and normal-PCR+.• The Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification dataset 24 is a publicly available collection of chest X-ray and OCT images.The chest X-rays were obtained from the University of California San Diego and are labeled as either "normal" or "pneumonia" to indicate the presence or absence of the disease.The total dataset comprises 1583 normal and 4273 pneumonia images.For this study, which aims to differentiate between COVID-19+ and COVID-19− images, only the images labeled "normal" were used.No metadata is associated with the images in this database.
Table 1 provides information on the data sources and database division for the training, internal validation, and external validation groups.These groups comprised 13 The external validation dataset comprised images from a number of sources, one of which was COVIDGR 23 .From this source, a total of 426 images of positive cases were utilized, with severity data available on the Severe-Moderate-Mild-Normal-PCR+ scale, which includes 79 Severe cases, 171 Moderate cases, 100 Mild cases, and 76 Normal-PCR+ cases.
Table 1.Summary of the datasets used in the research.

Study design
Five main steps were followed: 1.All the images by source and category (positive and negative) were collected and grouped.2. The dataset was divided into three sets: training, internal validation, and external validation.Without metadata for in-depth analysis, the preparation of the dataset before feeding it to the neural network has been based on ensuring balanced classification and avoiding overlap between image sources in the training and internal validation sets compared to those in the external validation set.The inclusion of images of the same subject in the same set was consistently maintained.Consequently, due to these constraints, the percentages for each class may slightly deviate from the intended values of 75% for training, 10% for internal validation, and 15% for external validation.3. Transfer learning was applied to three pre-trained networks using ImageNet.4. The models' performance was assessed using both internal and external datasets.Internal validation refers to using images from the same source as the training images, while external validation involved using images from different sources. 5.The outputs of all the models were combined to obtain a joint solution.
Figure 1 shows the project workflow.

Model training
The study used three pre-trained CNN architectures, namely Inception-ResNet-v2 (IRV2) 33 , ResNet50 34 , and DenseNet121 35 , all of which were originally trained on the ImageNet dataset.The selection of the networks was driven not by their distinctiveness but by their widespread use in image classification 19,36,37 .Opting for these architectures, instead of more sophisticated alternatives, was intended to streamline reproducibility and enhance experiment understanding, ultimately emphasizing the inherent difficulty of generalizing the models.Importantly, this pipeline proposal remains flexible and does not preclude the utilization of other pre-trained models.To apply transfer learning, all layers in the CNNs were frozen, and a classifier was added to the top of each network.
All input images were in either png, jpg, or jpeg format and were preprocessed by normalizing their pixel values to between 0 and 1.The images were also resized to the standard 256 × 256 × 3 pixels using bilinear interpolation, with the same image repeated in all colour channels.This resizing approach calculates pixel values in the resized image through linear interpolation, referencing surrounding pixel values from the original image.The choice of this image size was selected to strike a balance between model accuracy and computational efficiency 38,39 .
To construct the classifier, a series of layers were added to the pre-trained CNN architectures.These included a global average pooling layer, three fully connected (FC) layers with 128 (FC1-Dense), 64 (FC2-Dense), and 16 nodes (FC3-Dense), respectively, and ReLU activation.A dropout layer was added after each fully connected layer with a rate of 0.3 to prevent overfitting, and the 2-node dense output layer was activated by the softmax function.
The global average pooling layer computes the average of each feature map in the final convolutional layer, giving a fixed-length vector for each image.This vector was then fed into the subsequent layers.The fully connected layers performed a series of linear transformations on the input data and the ReLU activation function was applied to introduce non-linearity.The dropout layer randomly eliminated some nodes to prevent overfitting.Finally, the softmax function was applied to the output dense layer to predict probabilities for each class.These layers worked together to transform the CNN output into a probability distribution over the two classes.
The models were trained for 50 epochs, with a batch size of 128 and an Adam optimizer with a learning rate of 10 −4 .To prevent overfitting during training, the regularization technique employed was early stopping, where training was stopped on the criterion of a significant increase in loss.The CNN architectures and associated layers were selected and optimized to achieve accurate, efficient classification of images into two classes.Figure 2 illustrates the transfer learning architecture.

Model ensemble
In this study, we applied uncertainty-based weighting and entropy calculation to weight the outputs of different networks.Uncertainty-based weighting is a technique that aims to improve the accuracy of ensemble models by assigning different weights to each model output based on its level of uncertainty [40][41][42] .In this case, entropy is used as a measure of uncertainty, with higher entropy indicating greater uncertainty in the model's predictions.This weighting technique involves calculating the entropy for each model's prediction for a given input data point x.The entropy H(O i (x)) of each model i is calculated using Eq. ( 1), where p i (j) is the predicted probability of class j for model i, and c is the total number of classes.In our case, c = 2 as the variable j can take on two values: 1 or 2 (COVID-19− or COVID-19+).When p i (j = 1) means that the predicted probability that the data point x belongs to the COVID-19− class using the model i.Conversely, p i (j = 2) represents the probability that image x belongs to the COVID-19+ class using the model i.
The negative exponential of the entropies for each model is then summed up to obtain the denominator for the weight calculation using Eq. ( 2), where m is the total number of models, as three different models are used: ResNet50, DenseNet121 and IRV2, m = 3.
The weight w for each model i is calculated using Eq. ( 3), in which the negative exponential of the entropy of the models (Eq. 1) is divided by the sum of all negative exponentials of the entropies (Eq.2).
Finally, the total weighted output O(x) for each class j and model i is calculated using Eq. ( 4), where w i is the weighting factor for the model i and p i (j) is the predicted probability of class j for model i.
Using uncertainty-based weighting with entropy calculation, we can exploit the strengths of different models, thus improving the overall performance of the ensemble model.This technique also helps reduce the impact of outliers or poorly performing models, as their weights are lower due to their higher level of uncertainty.Furthermore, the use of entropy provides a mathematically rigorous method for measuring uncertainty, which can be particularly useful in complex or high-dimensional data.

Evaluation metrics
Various metrics were employed to assess the model's performance.These included accuracy, sensitivity/recall, specificity, precision, F1 score, and area under the curve (AUC).These measures were labelled thus: true positive (TP); true negative (TN); False positive (FP); and false negative (FN).TP refers to a subject with COVID-19 who tests positive; TN denotes a subject who does not have the disease and tests negative.FP corresponds to a subject who does not have COVID-19 but tests positive, and FN denotes a subject who has COVID-19 but tests negative.Sensitivity, as shown in Eq. ( 5), is particularly noteworthy.A classifier with 100% sensitivity correctly identifies all positive cases with the disease, which is crucial for detecting severe illnesses.
In addition to sensitivity, the study also assessed the specificity of the model, which measures the proportion of true negatives the model correctly identifies.Specificity is calculated using Eq. ( 6), The accuracy of the model was also evaluated.Accuracy is a widely used parameter in evaluating classifier performance and provides an overall assessment of the model's effectiveness.It is defined using Eq. ( 7), Precision and the F1 score, as indicated in Eqs. ( 8) and ( 9), were also calculated to assess the model's performance.Precision indicates how well the model correctly identifies positive cases and is represented as, The F1 score is a statistical measure that considers the model's precision and recall in its calculation and yields a value between 0 and 1.For all metrics, 95% confidence intervals (CI) have been calculated.Additionally, a two-tailed t-test has been conducted to compare the performance of the proposed ensemble method with the rest of the classifiers.The Null Hypothesis (H0) suggests that there is no significant difference between the means of the two models.A p value below 0.05 was considered statistically significant; therefore, if the p value is less than 0.05, there would be sufficient evidence to reject the null hypothesis.

Performance on internal validation dataset
In the first experiment, we used the internal validation set to evaluate the performance of the three networks alone.The confusion matrix obtained for each network is shown in Fig. 3.A more detailed analysis of the corresponding data is provided in Table 2.The best results were obtained using DenseNet121, achieving an accuracy of 96.71%, precision of 96.82%, sensitivity of 96.37%, specificity of 97.03%, F1 score of 96.59% and AUC of 96.70%.www.nature.com/scientificreports/

Performance on external validation dataset
In the second experiment, we used an external dataset comprising images taken from different sources to those used for training or internal validation.The confusion matrix obtained for each network is shown in Fig. 4. A more detailed analysis of the corresponding data is provided in Table 3.
The models' performance shows a notable decline in this scenario, with ResNet50 yielding the best results at accuracy at 78.38%, precision at 78.93%, specificity at 85.69%, F1 score of 78.15% and AUC of 77.33%.In terms of sensitivity, DenseNet121 achieved the best results at 82.91%.
Furthermore, our research focused on examining the effectiveness of severity-based COVID-19 detection by analyzing images from the COVIDGR 23 dataset.The TP percentages for each class are presented in Table 4. Overall, the models demonstrated a higher accuracy in correctly identifying more severe cases; but faced challenges in accurately classifying milder cases.
On analyzing the origin of the images, 70.44% of the images classified as FN were found to belong to the COVIDGR database 23 , and 53.59% of the images classified as FP belonged to the covid-chestxray-dataset 25 (one of the 8 data sources making up the COVIDx CXR-3 dataset).This source contains samples from patients who have tested positive or are suspected of having COVID-19 and samples from patients with other viral and bacterial

Model ensemble
The confusion matrix of the assembling model and the comparison between each individual model's performance and the ensemble are shown in Fig. 4 and Table 3.
The combination of models demonstrates improved classification for cases of both COVID-19+ and COVID-19−.Regarding internal validation, the model assembly enhances the results obtained by individual networks, achieving an accuracy of 97.38% and AUC of 97.35, as shown in Table 2.During external validation, certain aspects, such as sensitivity and specificity, performed better in other models, as shown in Table 3.However, these models exhibited weaknesses in other areas; for instance, ResNet50 achieved a specificity of 85.69% (p < 0.05), but its sensitivity was only 68.97% (p < 0.05), whereas DenseNet121 attained a sensitivity of 82.91% (p < 0.05), but its specificity dropped to 44.89% (p < 0.05).Therefore, the proposed model assembly in this study achieved a balanced solution, yielding a sensitivity of 80.97% and a specificity of 81.31%.These values represent the highest overall accuracy of 81.16% and AUC of 81.14%.
Regarding the severity study, the results in Table 4 indicate that both individual models and the model ensemble have higher detection rates for cases labeled as severe than cases classified as mild or normal-PCR+.

Benchmarking ensembling models
To assess the robustness of our ensemble model, we first conducted a performance comparison with other commonly used ensemble models using our external validation dataset.Specifically, we chose soft-voting methods that involve averaging and weighted averaging.For weighted averaging, we adopted an approach where weights are generated randomly using a Dirichlet distribution 43 .Additionally, we considered hard-voting methods based on majority voting.The findings in Table 5 reveal that, for the external validation dataset, the approach proposed in this article demonstrates superior overall performance compared to the other three methods.The only exception arises in the sensitivity measurement between averaging soft voting and the proposed ensemble method, where no statistically significant difference has been observed (p > 0.05).

Comparison to the state-of-the-art results
This article has conducted a comparative analysis contrasting our proposed ensemble approach with various state-of-the-art ensemble models applied to COVID-19 classification.Table 6 presents the results of these studies along with the methodologies employed and the type of validation performed.
Among the 14 studies scrutinized in the ensemble methods comparison for COVID-19 detection within the state of the art, merely 4 conducted external validation using images from sources distinct from those used in internal training and validation.Of these 4 studies, only 2 utilized more than one network and demonstrated results surpassing those of our model.In the first case, Deb et al. 18 implemented feature concatenation for four different models and assessed them using an external database comprising 92 images, with 29 belonging to the COVID-19+ class.They achieved an accuracy of 93.48% for the classification of 3 classes and 95.65% for binary classification.The number of images used in this study for external validation is considerably limited compared to our study, which involved a more extensive dataset comprising 4098 images, including 1792 COVID-19+ cases from four distinct sources.
In the second case, Wehbe et al. 19 employed a weighted average ensemble with 6 different models for binary classification.They evaluated these models on an external database containing 2214 images, of which 1192 were COVID-19+ and originated from a single source.The results exhibited an accuracy gain of 0.84% compared to our method, along with an increase of 6.86% in AUC, 11.69% in specificity and a decrease of 9.97% in sensitivity.In our comparison with commonly used ensemble models, we applied the same methodology as presented in this article when comparing weighted averaging.Notably, in our case, the performance of the proposed ensemble method remains statistically significantly superior to the weighted averaging approach as seen in Table 5.

Discussion
This study compared the performance of three pre-trained neural networks on an internal validation and an external validation dataset.Results showed that the models performed exceptionally well on the internal validation dataset, where the images are from the same source as the training dataset.DenseNet121 achieved the highest AUC (96.70%) on the internal validation dataset.
However, when we tested the same models on the external validation dataset, which contains images from a different source, performance dropped significantly.ResNet50 attained the highest AUC on the external validation dataset, reaching 77.33%.
Combining the output of the models has demonstrated improved classification performance, with AUC for the internal validation dataset rising to 97.35%, and external validation rising to 81.14%.This study used 3 models as proof of concept to demonstrate the contribution of network ensemble.However, this methodology can be extrapolated to a larger number of networks to achieve more robust results.
Additionally, the results of the t-test, which compares the performance of the ensemble model against each individual network, indicate that, in the case of internal validation, the ensemble outperforms the IRV2 and ResNet50 networks statistically.For DenseNet121, no significant differences are observed, except in precision and specificity values, where our ensemble shows better performance with p < 0.05.Regarding external validation, the proposed ensemble has demonstrated significantly higher accuracy, F1 score, and AUC compared to each individual network.
Regarding the severity analysis, the results in Table 4 reveal that the proposed ensemble of models is not the most suitable for detecting COVID-19+ cases for the severity labels specified.Considering that the number of images in the dataset containing severity metadata is relatively small, this may potentially limit the generalizability of the findings.Furthermore, the limited sample size may affect changes in percentages within the same categories, and therefore its impact.Nevertheless, it is worth noting that there is a noticeable tendency to classify severe cases with greater accuracy.
To highlight the robustness of our ensemble methodology, a performance comparison was conducted with commonly used methods in the literature, such as soft-voting and hard-voting.The results demonstrated that our proposed ensemble achieves the best outcomes.Thus, by combining the strengths and mitigating the weaknesses of individual models, a global model was developed that significantly enhances performance.This research not only serves as a proof of concept for streamlining the medical image classification process but also contributes to the advancement and fortification of these methodologies within the healthcare sector.Furthermore, the exploration of combining results from networks trained under diverse circumstances underscores the potential to improve overall performance, particularly when confronted with data unfamiliar to any of the individual networks.
On analyzing the results of the external validation dataset, we noted two factors that may influence network performance.
First, we found the COVIDGR 23 source highly effective for detecting severe cases of COVID-19, accuracy was lower regarding milder cases.These findings suggest that the models perform well when diagnosing severe cases, but may require further improvements to accurately detect milder cases.This also highlights the difficulty confirming COVID-19 using other techniques such as polymerase chain reaction (PCR) testing, as well as potential bias stemming from false positives.
Second, when images of other pathologies similar to COVID-19 were included, this affected the model's performance.
One of the major limitations of this study is the lack of metadata.Many of the currently available public databases contain no data on medical images.This drawback makes it difficult to convert current models into clinical applications.This research aimed to generate a database sufficiently representative of positive and negative COVID-19 cases.However, determining the variety of cases needs additional data such as age, sex, subject positioning, severity of the disease or contained pathologies.

Conclusion
We presented a domain adaptation study and we applied it in the context of COVID-19 detection using chest X-ray images.The study used 26,047 images from 6 different data sources to fine-tune 3 pre-trained networks: IRV2, ResNet50, and DenseNet121.For the internal validation of the model, 2676 images from the 6 different data sources in training were employed.External validation of the models used 4098 images from 4 different sources.Evaluation of the models revealed promising results in the internal validation set, showcasing accuracies ranging from 87 to 95%.However, these performance levels witnessed a significant decline when applied to the external dataset, with accuracies ranging from 61 to 78%.This contrast underscores the critical importance of assessing machine learning models across diverse datasets to guarantee that their performance is both robust and generalizable.
To improve the individual performance of the models, results from the 3 networks were combined by taking the weighted average of the output of the nodes, taking into account their entropy.This resulted in a balanced network that can detect both positive and negative cases with an accuracy of 81.16%, sensitivity of 80.97%, and specificity of 81.31% on external datasets.It is worth noting that these results present an important step forward toward utilizing a computer-based solution, with near real-time capabilities, compared to the time-intensive assessments carried out by expert clinicians.
Future research should include more models and investigate other methods for weighting networks aimed at more precise results in the detection of COVID-19 as well as apply to other domains.Additionally, deeper analysis leveraging metadata could provide insights into the limitations of the current study.These considerations contribute to a comprehensive understanding of the model's applicability and potential refinements for broader applications across various domains.
,534 COVID-19+ and 12,513 COVID-19− images for training, 1294 COVID-19+ and 1382 COVID-19− images for internal validation, and 1792 COVID-19+ and 2306 COVID-19− images for external validation.The absence of metadata underscores the importance of carefully selecting an external validation dataset, ensuring that the source of the images differs from those used in internal validation or training.It is crucial to highlight that this divergence involves images originating from different hospitals, each utilizing various imaging acquisition machines.Additionally, ensuring the proper calibration of both positive and negative cases has been implemented.

Figure 2 .
Figure 2. Flowchart of proposed transfer learning model.

Figure 3 .
Figure 3. Confusion matrix of transfer learning models on the internal dataset. 32

Table 2 .
Comparing internal validation results for the proposed ensemble model against transfer learning models.95% CI is represented as [lower bound-upper bound].Significant values are in [bold].* A statistically significant difference (p < 0.05) when comparing against the proposed assembling method.

Table 4 .
Analysis of true positive (TP) percentages in COVID-19 detection based on severity levels using the COVIDGR dataset.Significant values are in [bold].

Table 5 .
Comparing performance of external validation for the proposed ensemble model against other ensemble methods.95%CI is represented as [lower bound-upper bound].Significant values are in [bold].*denotes a statistically significant difference (p < 0.05) when comparing against the proposed assembling method.

Table 6 .
Comparing state-of-the-art results obtained from published ensemble methods for COVID-19 detection.